This notebook provides discussion, examples, and best practices for
working with netCDF datasets in Python using the xarray package.
Topics include:
This notebook is a companion to the 
Exploring netCDF Files
and
Exploring netCDF Datasets from ERDDAP
notebooks.
Those notebooks focus on using the
netcdf4-python package
to read netCDF datasets from local files and
ERDDAP servers on the Internet,
respectively.
This notebook is about using the xarray package
to work with netCDF datasets.
xarray uses the netcdf4-python package behind the scenes,
so datasets can be read from either local files or from ERDDAP servers.
xarray is a Python package that applies the concepts and tools for working with labeled data structures
from the pandas package to the physical sciences.
Whereas pandas excels at manipulating tablular data,
xarray brings similar power to working with N-dimensional arrays.
If you are already familiar with working with netCDF datasets via the netCDF4-python package,
you can think of xarray as a higher level Python tools for working with those dataset.
Creating netCDF files and working with their attribute metadata is documented elsewhere: http://salishsea-meopar-docs.readthedocs.org/en/latest/code-notes/salishsea-nemo/nemo-forcing/netcdf4.html.
This notebook assumes that you are working in Python 3. If you don't have a Python 3 environment set up, please see our Anaconda Python Distribution docs for instructions on how to set one up.
xarray and some of the packages that it depends on are not included in the default Anaconda
collection of packages,
so you may need to installed them explicitly:
$ conda install xarray netCDF4 bottleneck
bottleneck
is a package that speeds up NaN-skipping and rolling window aggregations.
If you are using a version of Python earlier than 3.5
(check with the command python --version),
you should also install cyordereddict
to speed internal operations with xarray data structures.
It is not required for Python ≥3.5 because collections.OrderedDict has been rewritten
in C,
making it even faster than cyordereddict.
Let's start with some imports. It's good Python form to keep all of our imports at the top of the file.
In [1]:
    
import numpy as np
import xarray as xr
    
Note that we alias numpy to np and xarray to xr
so that we don't have to type as much.
xarray provides a open_dataset function that allows us to load 
a netCDF dataset into a Python data structure by simply passing in
a file path/name,
or an ERDDAP server URL and dataset ID.
Let's explore the Salish Sea NEMO model bathymetry data:
In [3]:
    
ds = xr.open_dataset('https://salishsea.eos.ubc.ca/erddap/griddap/ubcSSnBathymetry2V1')
    
See the Exploring netCDF Datasets from ERDDAP notebook for more information about ERDDAP dataset URLs.
We could have opened the same dataset from a local file with:
ds = xr.open_dataset('../../NEMO-forcing/grid/bathy_meter_SalishSea2.nc')
In [4]:
    
lds = xr.open_dataset('../../NEMO-forcing/grid/bathy_meter_SalishSea2.nc')
    
Printing the string respresentation of the ds data structure that open_dataset()
returns gives us lots of information about the dataset and its metadata:
In [7]:
    
print(ds)
    
    
open_dataset() returns an xarray.Dataset object
that is xarray’s multi-dimensional equivalent of a pandas.DataFrame. 
It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions.
It is designed as an in-memory representation of the data model from the netCDF file format.
Dataset objects have four key properties:
dims: a dictionary mapping from dimension names to the fixed length of each dimension
(e.g., {'x': 6, 'y': 6, 'time': 8})data_vars: a dict-like container of DataArrays corresponding to variablescoords: another dict-like container of DataArrays intended to label points used in data_vars 
(e.g., arrays of numbers, datetime objects or strings)attrs: an OrderedDict to hold arbitrary metadataLet's look at them one at a time:
In [6]:
    
lds
    
    Out[6]:
In [9]:
    
ds.dims
    
    
In [11]:
    
ds.data_vars
    
    Out[11]:
In [12]:
    
ds.coords
    
    Out[12]:
So, we have a dataset that has 2 dimensions called gridX and gridY
of size 398 and 898, respectively,
3 variables called longitude, latitude, and bathymetry,
and 2 coordinates with the same names as the dimensions,
gridX and gridY.
The xarray docs have a 
good explanation and a diagram
about the distinction between coordinates and data variables.
If you are already familiar with working with netCDF datasets via the netCDF4-python package,
you will note that the dims and data_vars attributes provide similar information to that
produced by functions in the
SalishSeaTools.nc_tools module.
xarray provides a higher level Python interface to datasets.
We'll see how the dimensions and variables are related, and how to work with the data in the variables in a moment, but first, let's look at the dataset attributes:
In [15]:
    
ds.attrs
    
    Out[15]:
Dataset attributes are metadata. They tell us about the dataset as a whole: how, when, and by whom it was created, how it has been modified, etc. The meanings of the various attributes and the conventions for them that we use in the Salish Sea MEOPAR project are documented elsewhere.
Variables also have attributes :
In [18]:
    
ds.longitude
    
    Out[18]:
This tells us a whole lot of useful information about the longitude data values in our bathymetry dataset, for instance:
gridY and gridX dimensions, in that orderWe can access the attributes of the dataset variables using dotted notation:
In [21]:
    
ds.bathymetry.units, ds.bathymetry.long_name
    
    Out[21]:
Dataset variables are xarray.DataArray objects.
In addition to their attributes,
they carry a bunch of other useful properties and methods that you can read about in the
xarray docs.
Perhaps most importantly the data associated with the variables are stored as NumPy arrays. So, we can use NumPy indexing and slicing to access the data values. For instance, to get the latitudes and longitudes of the 4 corners of the domain:
In [31]:
    
ds.latitude.shape
    
    Out[31]:
In [28]:
    
print('Latitudes and longitudes of domain corners:')
pt = (0, 0)
print('  0, 0:        ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (0, ds.latitude.shape[1] - 1)
print('  0, x-max:    ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (ds.latitude.shape[0] - 1, 0)
print('  y-max, 0:    ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (ds.latitude.shape[0] - 1, ds.longitude.shape[1] - 1)
print('  y-max, x-max:', ds.latitude.values[pt], ds.longitude.values[pt])
    
    
You can also access the entire variable data array, or subsets of it using slicing.
In [34]:
    
ds.latitude.values
    
    Out[34]:
In [36]:
    
ds.longitude.values[42:45, 128:135]
    
    Out[36]:
In [37]:
    
ds.longitude.values[:2, :2], ds.latitude.values[-2:, -2:]
    
    Out[37]:
Note that the zero and maximum dimension values may be omitted for slices that extend to the ends of array dimensions.